Load a Sheet, assume there's no logic, we'll add the logic

You'll have run jupyter notebook at the command line or via the Windows Anaconda tool.

In this demo we add logic in place of Excel cell functions and export a finished sheet for comparison.



In [18]:

    
import pandas as pd

df = pd.read_excel("sheet_1_without_simple_logic.xls")
print(df)
# note the NaN (not-a-number) cells when we have no value









    



   Feature1  Feature2  DecisionF1F2         TVShow  Decision2  Decision3  \
0       0.6       0.6           NaN      Hollyoaks        NaN        NaN   
1       0.4       0.6           NaN      hollyoaks        NaN        NaN   
2       0.3       0.4           NaN     Hollyoaks         NaN        NaN   
3       0.9       0.8           NaN      hollyoaks        NaN        NaN   
4       0.9       0.8           NaN     holly-oaks        NaN        NaN   
5       0.9       0.8           NaN  best TV shows        NaN        NaN   

                    Comment  
0                       NaN  
1                       NaN  
2  trailing-space on TVShow  
3                       NaN  
4        badly-spelt TVShow  
5                       NaN



In [19]:

    
df.head(10) # this creates a Table view (non-interactive but prettier)









    Out[19]:






  
    
      
      Feature1
      Feature2
      DecisionF1F2
      TVShow
      Decision2
      Decision3
      Comment
    
  
  
    
      0
      0.6
      0.6
      NaN
      Hollyoaks
      NaN
      NaN
      NaN
    
    
      1
      0.4
      0.6
      NaN
      hollyoaks
      NaN
      NaN
      NaN
    
    
      2
      0.3
      0.4
      NaN
      Hollyoaks
      NaN
      NaN
      trailing-space on TVShow
    
    
      3
      0.9
      0.8
      NaN
      hollyoaks
      NaN
      NaN
      NaN
    
    
      4
      0.9
      0.8
      NaN
      holly-oaks
      NaN
      NaN
      badly-spelt TVShow
    
    
      5
      0.9
      0.8
      NaN
      best TV shows
      NaN
      NaN
      NaN



In [20]:

    
print("Column names:", df.columns)









    



Column names: Index(['Feature1', 'Feature2', 'DecisionF1F2', 'TVShow', 'Decision2',
       'Decision3', 'Comment'],
      dtype='object')



In [21]:

    
print("Information about each row including data types:")
print("(note - type 'object' is catch-all that includes strings)")
df.info()









    



Information about each row including data types:
(note - type 'object' is catch-all that includes strings)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 7 columns):
Feature1        6 non-null float64
Feature2        6 non-null float64
DecisionF1F2    0 non-null float64
TVShow          6 non-null object
Decision2       0 non-null float64
Decision3       0 non-null float64
Comment         2 non-null object
dtypes: float64(5), object(2)
memory usage: 384.0+ bytes



In [22]:

    
print("\nWe can extract a column of data as a Series object:")
print(df['Feature1'])









    



We can extract a column of data as a Series object:
0    0.6
1    0.4
2    0.3
3    0.9
4    0.9
5    0.9
Name: Feature1, dtype: float64



In [23]:

    
row = df.ix[0]
print("\nWe can extract a row as a Python dictionary:")
print(row)









    



We can extract a row as a Python dictionary:
Feature1              0.6
Feature2              0.6
DecisionF1F2          NaN
TVShow          Hollyoaks
Decision2             NaN
Decision3             NaN
Comment               NaN
Name: 0, dtype: object



In [24]:

    
print("\nRow items, e.g. Feature1={feature1}".format(feature1=row['Feature1']))









    



Row items, e.g. Feature1=0.6000000000000001



In [25]:

    
def decision_f1_f2(row):
    feature_1 = row['Feature1']
    feature_2 = row['Feature2']
    if feature_1 > 0.5:
        if feature_2 > 0.5:
            return True
    return False

# we'll use apply on the entire DataFrame, axis=1 means row-wise (not column-wise)
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
df.apply(decision_f1_f2, axis=1)
# note this doesn't change the DataFrame, it generates a new separate Series
# and here we just print it and then discard it









    Out[25]:





0     True
1    False
2    False
3     True
4     True
5     True
dtype: bool



In [27]:

    
df.head(10) # confirm we haven't modified the DataFrame









    Out[27]:






  
    
      
      Feature1
      Feature2
      DecisionF1F2
      TVShow
      Decision2
      Decision3
      Comment
    
  
  
    
      0
      0.6
      0.6
      NaN
      Hollyoaks
      NaN
      NaN
      NaN
    
    
      1
      0.4
      0.6
      NaN
      hollyoaks
      NaN
      NaN
      NaN
    
    
      2
      0.3
      0.4
      NaN
      Hollyoaks
      NaN
      NaN
      trailing-space on TVShow
    
    
      3
      0.9
      0.8
      NaN
      hollyoaks
      NaN
      NaN
      NaN
    
    
      4
      0.9
      0.8
      NaN
      holly-oaks
      NaN
      NaN
      badly-spelt TVShow
    
    
      5
      0.9
      0.8
      NaN
      best TV shows
      NaN
      NaN
      NaN



In [29]:

    
# we can assign the result back to the DataFrame as a new column
new_result = df.apply(decision_f1_f2, axis=1)
df['DecisionF1F2'] = new_result
df.head(10)









    Out[29]:






  
    
      
      Feature1
      Feature2
      DecisionF1F2
      TVShow
      Decision2
      Decision3
      Comment
    
  
  
    
      0
      0.6
      0.6
      True
      Hollyoaks
      NaN
      NaN
      NaN
    
    
      1
      0.4
      0.6
      False
      hollyoaks
      NaN
      NaN
      NaN
    
    
      2
      0.3
      0.4
      False
      Hollyoaks
      NaN
      NaN
      trailing-space on TVShow
    
    
      3
      0.9
      0.8
      True
      hollyoaks
      NaN
      NaN
      NaN
    
    
      4
      0.9
      0.8
      True
      holly-oaks
      NaN
      NaN
      badly-spelt TVShow
    
    
      5
      0.9
      0.8
      True
      best TV shows
      NaN
      NaN
      NaN



In [31]:

    
def decision_tvshow(cell):
    cleaned_cell = cell
    cleaned_cell = cleaned_cell.lower() # lowercase
    is_hollyoaks = cleaned_cell == "hollyoaks"
    return is_hollyoaks

# we can assign the result back to the DataFrame as a new column
df['Decision2'] = df['TVShow'].apply(decision_tvshow)
df.head(10)









    Out[31]:






  
    
      
      Feature1
      Feature2
      DecisionF1F2
      TVShow
      Decision2
      Decision3
      Comment
    
  
  
    
      0
      0.6
      0.6
      True
      Hollyoaks
      True
      NaN
      NaN
    
    
      1
      0.4
      0.6
      False
      hollyoaks
      True
      NaN
      NaN
    
    
      2
      0.3
      0.4
      False
      Hollyoaks
      False
      NaN
      trailing-space on TVShow
    
    
      3
      0.9
      0.8
      True
      hollyoaks
      True
      NaN
      NaN
    
    
      4
      0.9
      0.8
      True
      holly-oaks
      False
      NaN
      badly-spelt TVShow
    
    
      5
      0.9
      0.8
      True
      best TV shows
      False
      NaN
      NaN

Add a strip to remove whitespace from the "Hollyoaks " example



In [32]:

    
def decision_tvshow_is_hollyoaks(cell):
    cleaned_cell = cell
    cleaned_cell = cleaned_cell.lower() # lowercase
    cleaned_cell = cleaned_cell.strip() # remove superflous whitespace
    is_hollyoaks = cleaned_cell == "hollyoaks"
    return is_hollyoaks

# we can assign the result back to the DataFrame as a new column
df['Decision2'] = df['TVShow'].apply(decision_tvshow_is_hollyoaks)
df.head(10)









    Out[32]:






  
    
      
      Feature1
      Feature2
      DecisionF1F2
      TVShow
      Decision2
      Decision3
      Comment
    
  
  
    
      0
      0.6
      0.6
      True
      Hollyoaks
      True
      NaN
      NaN
    
    
      1
      0.4
      0.6
      False
      hollyoaks
      True
      NaN
      NaN
    
    
      2
      0.3
      0.4
      False
      Hollyoaks
      True
      NaN
      trailing-space on TVShow
    
    
      3
      0.9
      0.8
      True
      hollyoaks
      True
      NaN
      NaN
    
    
      4
      0.9
      0.8
      True
      holly-oaks
      False
      NaN
      badly-spelt TVShow
    
    
      5
      0.9
      0.8
      True
      best TV shows
      False
      NaN
      NaN

Combine DecisionF1F2 and Decision2



In [34]:

    
# use a different way to access the columns (using .colname - this only works for
# easy-to-read ASCII names, it won't work with funny characters)
# and use the logical and (&) to do pairwise logic, assigning the result to our new column
df['Decision3'] = df.DecisionF1F2 & df.Decision2
df.head(10)









    Out[34]:






  
    
      
      Feature1
      Feature2
      DecisionF1F2
      TVShow
      Decision2
      Decision3
      Comment
    
  
  
    
      0
      0.6
      0.6
      True
      Hollyoaks
      True
      True
      NaN
    
    
      1
      0.4
      0.6
      False
      hollyoaks
      True
      False
      NaN
    
    
      2
      0.3
      0.4
      False
      Hollyoaks
      True
      False
      trailing-space on TVShow
    
    
      3
      0.9
      0.8
      True
      hollyoaks
      True
      True
      NaN
    
    
      4
      0.9
      0.8
      True
      holly-oaks
      False
      False
      badly-spelt TVShow
    
    
      5
      0.9
      0.8
      True
      best TV shows
      False
      False
      NaN

Output a new XLS



In [35]:

    
writer = pd.ExcelWriter('sheet_1_with_added_logic_generated_via_pandas.xlsx', engine='xlsxwriter')
df.to_excel(writer, index=False, sheet_name='Our New Sheet')
workbook = writer.book
worksheet = writer.sheets['Our New Sheet']
writer.save()
# note we could add lots of conditional formatting for Excel via:
# http://pbpython.com/improve-pandas-excel-output.html

Let's be clever - we'll do some text processing

We'll try the external fuzzywuzzy library to do some very simple text processing.

If needed install using conda install -c wpb fuzzywuzzy.



In [47]:

    
from fuzzywuzzy import fuzz
# http://pbpython.com/excel-pandas-comp.html side reading on fuzzywuzzy
# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/ other fuzzywuzzy metrics for 
# other ways of measuring similarity



In [37]:

    
# quick demo:
item1 = "hollyoaks"
target = "hollyoaks"
print(target, fuzz.ratio(item1, target))
target = 'holly-oaks'
print(target, fuzz.ratio(item1, target))
target = 'holly oak'
print(target, fuzz.ratio(item1, target))
target = "tv's best moments"
print(target, fuzz.ratio(item1, target))









    



hollyoaks 100
holly-oaks 95
holly oak 89
tv's best moments 15



In [41]:

    
def decision_tvshow_is_hollyoaks_with_smarts(cell):
    cleaned_cell = cell
    cleaned_cell = cleaned_cell.lower() # lowercase
    cleaned_cell = cleaned_cell.strip() # remove superflous whitespace
    is_hollyoaks =  fuzz.ratio("hollyoaks", cleaned_cell) > 85
    return is_hollyoaks

# we can assign the result back to the DataFrame as a new column
df['TVShow'].apply(decision_tvshow_is_hollyoaks_with_smarts)









    Out[41]:





0     True
1     True
2     True
3     True
4     True
5    False
Name: TVShow, dtype: bool



In [42]:

    
df.head(10) # show that we *haven't* yet manipulated the DataFrame









    Out[42]:






  
    
      
      Feature1
      Feature2
      DecisionF1F2
      TVShow
      Decision2
      Decision3
      Comment
    
  
  
    
      0
      0.6
      0.6
      True
      Hollyoaks
      True
      True
      NaN
    
    
      1
      0.4
      0.6
      False
      hollyoaks
      True
      False
      NaN
    
    
      2
      0.3
      0.4
      False
      Hollyoaks
      True
      False
      trailing-space on TVShow
    
    
      3
      0.9
      0.8
      True
      hollyoaks
      True
      True
      NaN
    
    
      4
      0.9
      0.8
      True
      holly-oaks
      False
      False
      badly-spelt TVShow
    
    
      5
      0.9
      0.8
      True
      best TV shows
      False
      False
      NaN



In [45]:

    
df['Decision2'] = df['TVShow'].apply(decision_tvshow_is_hollyoaks_with_smarts)
df.head(10)
# note that we *haven't re-run Decision3*!









    Out[45]:






  
    
      
      Feature1
      Feature2
      DecisionF1F2
      TVShow
      Decision2
      Decision3
      Comment
    
  
  
    
      0
      0.6
      0.6
      True
      Hollyoaks
      True
      True
      NaN
    
    
      1
      0.4
      0.6
      False
      hollyoaks
      True
      False
      NaN
    
    
      2
      0.3
      0.4
      False
      Hollyoaks
      True
      False
      trailing-space on TVShow
    
    
      3
      0.9
      0.8
      True
      hollyoaks
      True
      True
      NaN
    
    
      4
      0.9
      0.8
      True
      holly-oaks
      True
      False
      badly-spelt TVShow
    
    
      5
      0.9
      0.8
      True
      best TV shows
      False
      False
      NaN



In [46]:

    
df['Decision3'] = df.DecisionF1F2 & df.Decision2
df.head(10)









    Out[46]:






  
    
      
      Feature1
      Feature2
      DecisionF1F2
      TVShow
      Decision2
      Decision3
      Comment
    
  
  
    
      0
      0.6
      0.6
      True
      Hollyoaks
      True
      True
      NaN
    
    
      1
      0.4
      0.6
      False
      hollyoaks
      True
      False
      NaN
    
    
      2
      0.3
      0.4
      False
      Hollyoaks
      True
      False
      trailing-space on TVShow
    
    
      3
      0.9
      0.8
      True
      hollyoaks
      True
      True
      NaN
    
    
      4
      0.9
      0.8
      True
      holly-oaks
      True
      True
      badly-spelt TVShow
    
    
      5
      0.9
      0.8
      True
      best TV shows
      False
      False
      NaN



In [44]:

    
# We could copy the Excel code down here and write out a new spreadsheet to disk...



In [ ]:

	Feature1	Feature2	DecisionF1F2	TVShow	Decision2	Decision3	Comment
0	0.6	0.6	NaN	Hollyoaks	NaN	NaN	NaN
1	0.4	0.6	NaN	hollyoaks	NaN	NaN	NaN
2	0.3	0.4	NaN	Hollyoaks	NaN	NaN	trailing-space on TVShow
3	0.9	0.8	NaN	hollyoaks	NaN	NaN	NaN
4	0.9	0.8	NaN	holly-oaks	NaN	NaN	badly-spelt TVShow
5	0.9	0.8	NaN	best TV shows	NaN	NaN	NaN

	Feature1	Feature2	DecisionF1F2	TVShow	Decision2	Decision3	Comment
0	0.6	0.6	True	Hollyoaks	NaN	NaN	NaN
1	0.4	0.6	False	hollyoaks	NaN	NaN	NaN
2	0.3	0.4	False	Hollyoaks	NaN	NaN	trailing-space on TVShow
3	0.9	0.8	True	hollyoaks	NaN	NaN	NaN
4	0.9	0.8	True	holly-oaks	NaN	NaN	badly-spelt TVShow
5	0.9	0.8	True	best TV shows	NaN	NaN	NaN